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Abstract 

1 We consider supervised learning problems within the positive-definite kernel framework, 

I 1 . such as kernel ridge regression, kernel logistic regression or the support vector machine. With 

^ | kernels leading to infinite-dimensional feature spaces, a common practical limiting difficulty 

O ■ is the necessity of computing the kernel matrix, which most frequently leads to algorithms 

with running time at least quadratic in the number of observations n, i.e., 0(n 2 ). Low-rank 
approximations of the kernel matrix are often considered as they allow the reduction of running 
J> 1 time complexities to 0(p 2 n), where p is the rank of the approximation. The practicality of such 

. methods thus depends on the required rank p. In this paper, we show that for approximations 

based on a random subset of columns of the original kernel matrix, the rank p may be chosen to 
be linear in the degrees of freedom associated with the problem, a quantity which is classically 
used in the statistical analysis of such methods, and is often seen as the implicit number of 
parameters of non-parametric estimators. This result enables simple algorithms that have sub- 
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" quadratic running time complexity, but provably exhibit the same predictive performance than 

existing algorithms. 

^ 1 Introduction 

Kernel methods, such as the support vector machine or kernel ridge regression, are now widely 
used in many areas of science and engineering, such as computer vision or bioinformatics (see, 
e.g., H1I21 IH). Their main attractive features are that (1) they allow non-linear predictions through 
the same algorithms than for linear predictions, owing to the kernel trick; (2) they allow the sepa- 
ration of the representation problem (designing good kernels for non-vectorial data) and the algo- 
rithmic/theoretical problems (given a kernel, how to design, run efficiently and analyze estimation 
algorithms). Moreover, (3) their applicability goes beyond supervised learning problems, through 
the kernelization of classical unsupervised learning techniques such as principal component analysis 
or K-means. Finally, (4) probabilistic Bayesian interpretations through Gaussian processes allow 
their simple use within larger probabilistic models. For more details, see, e.g., O EH 121 - 

However, kernel methods typically suffer from at least quadratic running-time complexity in the 
number of observations n, as this is the complexity of computing the kernel matrix. In large- 
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scale settings where n may be large, this is usually not acceptable. In these situations where plain 
kernel methods cannot be run, practitioners would commonly (a) turn to methods such as boosting, 
decision trees or random forests, which have both good running time complexity and predictive 
performance. However, these methods are typically run on data coming as vectors and usually put a 
strong emphasis on a sequence of decisions based on single variables. Another common solution is 
(b) to stop using infinite-dimensional kernels and restrict the kernels to be essentially linear kernels 
(i.e., by choosing an explicit representation of the data whose size is independent of the number 
of observations) where the non-parametric kernel machinery (of adapting the complexity of the 
underlying predictor to the size of the dataset) is lost, and the methods may then underfit. 

In this paper, we consider the traditional kernel set-up for supervised learning, where the input 
data are only known through (portions of) the kernel matrix. The main question we try to tackle 
is the following: Is it possible to run supervised learning methods with positive-definite kernels in 
time which is subquadratic in the number of observations without losing prediction performance? 
Of course, if adaptation is desired, linear complexity seems impossible, and therefore we should 
expect (hopefully slightly) super-linear algorithms. Statistically, a quantity that characterizes the 
non-parametric nature of kernel method is the degrees of freedom, which play the role of an implicit 
number of parameters and which we define in Section |4~T1 Does it play a role in the computational 
properties of kernel methods? 

An important feature of kernel matrices is that they are positive-semidefinite, and thus they may well 
be approximated from a random subset of p of their columns, in running-time complexity 0{p 2 n) 
and with a computable bound on the error (see details in Section [3]). This appears through different 
formulations within numerical linear algebra or machine learning, e.g., Nystrom method [6j, sparse 
greedy approximations 0, incomplete Cholesky decomposition (H, Gram-Schmidt orthonormal- 
ization [2] or CUR matrix decompositions |9). It has been studied a lot @[lOl[TT|], i n contexts where 
the goal is kernel matrix approximation or approximate eigenvalue decomposition. Such bounds 
have also been subsequently used to characterize the approximation of predictions made from these 
low -rank decompositions lTT2l[T3l . but this two-stage analyses do not lead to guarantees that reflect 
the good observed practical behavior. In this paper, our analysis aims at answering explicitly the 
simple question: how big should p be to incur no loss of predictive performance compared to the 
full kernel matrix? The key insight of this paper is not to try to approximate the kernel matrix well, 
but to predict well from the approximation. This requires a sharper analysis of the approximation 
properties of the column sampling approach. 

We make the following contributions: 

- In the least-squares regression setting, we show in Section l4~2l that the rank p can be chosen to 
be linear in the degrees of freedom associated with the problem. 

- We present in Section |4~41 simple algorithms that have sub-quadratic running time complexity, 
and, for the square loss, provably exhibit the same predictive performance as classical algo- 
rithms than run in quadratic time (or more). 

- We provide in Section R31 explicit examples of optimal values of the regularization parameters, 
as a function of the decay of the eigenvalues of the kernel matrix, shedding some light in the 
joint computational/statistical trade-offs for choosing a good kernel. In particular, we show that 
with kernels with fast spectrum decays (such as the exponential or Gaussian kernels), com- 
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putational limitations may prevent exploring the relevant portions of the regularization paths, 
leading to underfitting. 



2 Supervised learning with positive-definite kernels 

In this section, we present the problem we try to solve, as well as several areas of the machine 
learning and statistics literatures our method relates to. 

2.1 Equivalent formulations 

Let (xi, yi), i = 1, . . . , n, be n pairs of points in X x y, where X is the input space, and y is the 
set of outputs/labels. In this paper, we consider the problem of minimizing 

1 71 \ 

mm-J2t(yiJ(xi)) + ^\\ff, (D 
/eJ 7 n 2 
i=i 

where J 7 is a reproducing kernel Hilbert space with feature map <fi : X — > F, and positive-definite 
kernel k : X X X — > R. While this problem is formulated as an optimization problem in a Hilbert 
space, it may be formulated as the optimization over W 1 in two different ways. 

First, using the representer theorem (see, e.g., 11112), the unique solution / may be found as / = 
Z~2?=i OLi<j){xi). Thus, by replacing the expression of / in Eq. (GQ), a is a solution of the following 
optimization problem: 

1 n A 
min - V £{y h (Ka)i) + -a T Ka, (2) 
aelR™ n ^-^ 2 

i=l 

where K £ M nxn is the kernel matrix, defined as Ky = k(xi,Xj). 

Second, for convex losses only, an equivalent dual problem is classically obtained as (see proof in 
the appendix): 

max — g(— Xa) a T Ka, (3) 

where g(z) = max ug ]s» —- Yli=i ^{Vii u i) + u i z i i s trie Fenchel-conjugate of the empirical risk 
(for the hinge loss, Eq. ([3]) is exactly the classical dual formulation of the SVM). Again, one may 
express the primal solution as / = z~27=i a i4'( x i)- I n many situations (such as with the square loss 
or logistic loss), then the solution of Eq. is unique, and it is also a solution of Eq. (O (note 
however that the converse is not true). 



2.2 Related work 

Efficient optimization algorithms for kernel methods. In order to solve Eq. (OQ), algorithms typ- 
ically consider a primal or a dual approach. Solving Eq. © (i.e., primal formulation after applica- 
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tion of the representer theorem) is typically inefficient because the problem is ill-conditioned..] and 
thus second-order algorithms are typically used lfl4l . Alternatively, K is represented explicitly as 
K = <3?<E> T and a change of variable w = <£ T a is considered (note that when the kernel k is linear, 
is simply the design matrix, and we are solving directly a linear supervised learning problem). 
Then, the classical battery of convex optimization algorithms may be used, such as gradient de- 
scent, stochastic gradient descent lfl31 or cutting-planes ifloll . However, in a kernel setting where a 
small matrix <E> (i.e., with few columns) is not known a priori, then they all exhibit at least quadratic 
complexity in n, as the full kernel matrix is used. 

The dual problem in Eq. Q is usually better-behaved (it has a better condition number) 1141 . and 
algorithms such as coordinate descent and its variants such as sequential minimal optimization may 
be used f 171. Again, the full kernel matrix is needed. 

Some algorithms do not need to compute the full kernel matrix, such as the "forgetron" lPT8l or the 
"projectron" |fl9ll , which operate online on a fixed budget and come with theoretical approximation 
guarantees; however, these do not characterize the required rank which is needed to achieve the 
same accuracy than the problem with a full kernel matrix. In fact, one of the main motivations for 
this work is to derive precise bounds for reduced-set stochastic gradient algorithms for supervised 
kernel problems. 

Analysis of column sampling approximation. Given a positive semi-definite matrix K of size n, 
many methods exist for approximating it with a low-rank (typically also positive semidefinite) ma- 
trix L. While the optimal approximation is obtained from the eigenvalue decomposition, it is not 
computationally efficient as it has complexity at least quadratic in n (as it requires the knowledge 
of K). In order to achieve linear complexity in n, approximations from subsets of columns are 
considered and appear under many names: Nystrom method 10, sparse greedy approximations Q, 
incomplete Cholesky decomposition |8], Gram-Schmidt orthonormalization J2) or CUR matrix de- 
compositions [91. Note that reduced-set methods (see, e.g., ll20l ) typically consider using a subset of 
columns after the predictor has been estimated. These low-rank methods are described in Section [3] 
and have running time complexity 0{p 2 n) for an approximation of rank p. Note that they may also 
be used in a Bayesian setting with Gaussian processes (see, e.g., II2TI ). 

Column sampling has been analyzed a lot (9j [I2j [TTJ ; however, typically the analysis provides a 
bound on the error \\K — L\\ for an appropriate norm (typically operator, Frobenius or trace norm), 
but this is too pessimistic and does not really match with good practical performance (see more 
details in Section l4~2l ). Some works do consider prediction guarantees lfl2l [T3l . but as shown in 
Section 14.21 these are not sufficient to reach sharp results depending on the degrees of freedom. 
Moreover, many analyses consider situations where the matrix K is close to low-rank, which is not 
the case with kernel matrices. In this paper, the control of K — L is more precise and adapted to the 
use of K within a supervised learning method. 

Dimension reduction for linear predictions. The method presented in this paper, which considers 
random columns from the original kernel matrix, is also related to random projection techniques 
used for linear prediction problems (see, e.g., ll22l l23l ). These techniques require however the 
knowledge of a matrix square root <I> (such that K = $<£ T ), which leads to complexity greater than 

'The objective function in Eq. ((2) is a function of K x ^ 2 a, with a kernel matrix K which is often ill-conditioned, 
usually leading to ill-conditioning of the original problem 1141 . 
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quadratic, and then consider replacing it by where A is a random matrix with fewer columns 
than rows. 

Theoretical analysis of predictive performance of kernel methods. In order to assess the re- 
quired precision in approximating the kernel matrix, it is key to understand the typical predictive 
performance of kernel methods. For the square loss, this is classically obtained from a bias-variance 
decomposition of this performance (see Section [4]). A key quantity is the degrees of freedom, which 
play the role of an implicit number of parameters and is applicable to many non-parametric es- 
timation methods which consists in "smoothing" the response vector by a linear operator (see, 
e.g., [24l|25]|26l|27l). See precise definitions in Section|4~T] 

3 Approximation from subset of columns 

Approximation from columns. Given a random subset I of V = {1, ... ,n} of cardinality p, 
we simply consider the approximation of the kernel matrix K from the knowledge of K(V, I) (the 
columns of K indexed by /), by the matrix 

L = K(V, I)K(I, I) j K(I, V) = k(x v , xi)k( Xl , x / ) t A;(x / , x v ), (4) 

where M> denotes the pseudo-inverse of M, and k(xA, xb) denotes the \A\ x \B \ matrix composed 
of elements k(xi,Xj) for G A x B. This corresponds to creating an explicit feature map of 
dimension p, i.e., <p(x) = k(xi, xi)~ 1 ^ 2 k(xj, x) £ W, and, this allows the application to test data 
points (note that using such techniques also allows better testing running time peformance). 

Given the true feature map <p(x) £ T , we have cf>(x) = k(xj,xj)~ 1 ' 2 (p(xi) T 4>(x) € W, and thus 
we simply perform a linear dimension reduction. Given the fact that we consider random subsets / 
of size p, this is similar to a random projection, but here the randomness is associated to the specifics 
of the kernel problem. 

Such a feature map may be efficiently obtained in running time 0(p 2 n) using incomplete Cholesky 
decomposition (often interpreted as partial Gram-Schmidt orthonormalization O), with the possi- 
bility of having a bound on the trace norm of the approximation error (see, e.g., 0). 

Pivoting vs. random sampling. While selecting a random subset is computationally efficient, it 
may not lead to the best performance. For the task of approximating the kernel matrix, algorithms 
such as the incomplete Cholesky decomposition with pivoting, provide an approximate greedy al- 
gorithm with the same complexity than random subsampling (TJUl. 

In Section[5l we provide comparisons between the two approaches, showing the potential advantage 
of the greedy method over random subsampling. However, the analysis of such algorithms is harder, 
and, to the best of our knowledge, still remains an open problem. 

4 Fixed design analysis for least-square regression (ridge regression) 

To simplify the analysis, we assume that the n data points xi,...,x n are deterministic and that 
y = R. In this setting, the classical generalization error (prediction error on unseen data points) 
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is replaced by the in-sample prediction error (prediction error on observed data points). This fixed 
design assumption could be relaxed by using tools from ll27Tl (results for random design settings are 
typically similar to the fixed design settings). 

We assume that the loss I is the square loss, i.e., £(yi,f(xi)) = ^(yi — f(xi)) 2 . By using the 
representer theorem (see Section |2J}, we classically obtain: 

n 

f(x) = <Xik(x, Xi) with a = {K + nXI)~ 1 y. 
i=i 

This leads to a prediction vector z = K(K + nXI)~ l y € R n , which is a linear function of the 
output observations y, and is often referred to as a smoothed estimate of z. 

4.1 Analysis of the in-sample prediction error 

We denote by Zi = Eyj G R the expectation of y.j, and we denote by £j = y% — Zi = yi— Ej/j G R 
the noise variables; they have zero mean and are only assumed to have finite covariance matrix C 
(note that the noise may neither be independent nor identically distributed). 

Bias/variance decomposition of the generalization error. Following classical results from the 
statistics literature (see, e.g., Il24ll25ll26l ), we obtain the following expected prediction error: 

±E e ||z-z|| 2 = ±||E e i-z|| 2 + ±trvar e (z) 

= ±||(J - K[K + n\iy x )zf + ± tr CK 2 (K + nXI)- 2 
= n\ 2 z T (K + n\I)- 2 z + ± tr CK 2 (K + n\I)~ 2 , 

which may be classically decomposed in two terms: 

bias(Tf) = n\ 2 z 1 (K + n\iy 2 z 
variance(Tf) = ± tr CK 2 (K + n\I)- 2 . 

Note that the bias term is a matrix-decreasing function of Kj X (and thus an increasing function 
of A), while the variance term is a matrix-increasing function of K/X and the noise covariance 
matrix C. 

Degrees of freedom. Note that an assumption which is usually made is C = a 2 1, and the variance 
term then takes the form a 2 tr K 2 (K + nXI)~ 2 and tr K 2 (K + nXI)~ 2 is referred to as the degrees 
of freedom |[24l|25ll26ll27l (note that an alternative definition is often used, i.e., tr K(K + nA7) _1 , 
and that as shown in the appendix, they behave similarly). In ordinary least-squares estimation from 
d variables, the variance term is equal to a 2 d/n, and thus the degrees of freedom play the role of 
an implicit number of parameters. In this paper, we show that a proxy to this statistical quantity 
also plays a role in optimization: the number of columns needed to approximate the kernel matrix 
precisely enough to incur no loss of performance is linear in the degrees of freedom. 

More precisely, we define the maximal marginal degrees of freedom d as 

d = n||diag(7r(7^ + nA7)- 1 )|| oo . (5) 
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We have tvK 2 (K + nXI)~ 2 ^ trK{K + nXI)' 1 = || diag (K(K + nXiy 1 )^ d, and 
thus d provides an upper-bound on the regular degrees of freedom. It may be significantly larger in 
situations where there may be outliers and the vector diag (iT(iv" + nA/) -1 ) is far from uniform. 



4.2 Predictive performance of column sampling 



We consider sampling p columns (without replacement) from the original n columns. We consider 
the column sampling approximation defined in Eq. (@]) and provide sufficient conditions (a lower- 
bound on p) to obtain the same predictive performance than with the full kernel matrix. 

Theorem 1 (Generalization performance of column sampling) Assume z G IR n and K G ]R nxn 

are respectively a deterministic vector and a symmetric positive semi-definite matrix, and A > 0. 
Let d = tt, 1 1 diag (iT(i^ + ?iA/) _1 ) || and R 2 = || diag(iv")||oo. Assume e E M. n is a random vector 
with finite variance and zero mean, and define the smoothed estimate zk = (K + nXI)~ 1 K(z + 
e). Assume that I is a uniform random subset of p indices in {1, ... ,n} and consider L = 
K(V, I)K(I, I)* 1 K(I, V), with the approximate smoothed estimate zl = (L + nXI)~ 1 L(z + e). 
Let 5 e (0, 1). // 

^ ,32d n x . nR 2 

(— + 2 )lo g7r , (6) 

then 

-EjEJIzl - z\\ 2 < -(1 + 45)E £ \\z K -z\\ 2 . (7) 
n n 



We can make the following observations: 



- The bound in Eq. © provides a relative approximation guarantee: the predictions zl are shown 
to perform as well as zk (no kernel matrix approximation). Small values of 5 impose no loss 
of performance, while 5 = 1/4 impose that the prediction errors have a similar behavior (up to 
a factor of 2). 

- The lower bound for the rank p in Eq. Q shows that the maximal marginal degrees of freedom 
provides a quantity which, up to logarithmic terms, is sufficient to scale with, in order to incur 
no loss of prediction performance. Note that the previous result also allows the derivation of an 
approximation guarantee 5 given a rank p, by inverting Eq. ©. 

- The bound in Eq. (O provides a result in expectation, both with respect to the data (i.e., E e ) 
and the sampling of columns (i.e., ¥,j). While results in high-probability with respect to / 
are readily obtained (in fact, the proof is based on such results), doing so with respect to e 
would require additional assumptions, which are standard in the analysis in ridge regression 
(see, e.g., Il27ll28l ). but that would make the results significantly more complicated. 

- Theorem[T]shows that in the specific instance that we are faced with, we do not lose any average 
predictive performance. This is different than achieving a good approximation of the kernel 
matrix (91 • Previous work lfT2l[T3l considers explicitly the use of kernel matrix approximation 
bounds within classifiers or regressors, but obtain bounds that involve multiplicative terms of the 
form 1/A 2 , which, as we show in Section l4~3l would grow as n grows. Our proof technique, that 
focuses directly on prediction performance, avoids this, and our dependence is only logarithmic 
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in A (see details in the proof in the appendix). Finally, as opposed to |[T3l . the bound is not on 
the worst-case predictive performance (obtained from optimizing over A, and with worst-case 
analysis over K), but for given A and K. 

- Theorem[T]provides a sufficient lower-bound for the required rank p. Deriving precise necessary 
lower-bounds is outside the scope of this paper. However, given that with a reduced space of 
p dimensions, we can achieve a prediction error of 0(p/n) from ordinary least-squares, we 
should expect p to be larger than the known minimax rates of estimation for the problem at 
hand (see, e.g., 1291 ). In Section 1431 we show that in some situations, it turns out that d is of 
the order of the minimax rate; therefore, we could expect that in certain settings, d is also a 
necessary lower-bound on p (up to constants and logarithms). 

- In the existing analysis of sampling techniques for kernel methods, another source of ineffi- 
ciency which makes our result sharper is the proof technique for bounding \\K — L\\. Indeed, 
most analyses use a linear algebra lemma from (9l[T0l, that relies on the (p+ l)-th eigenvalue to 
be small; hence it is adapted to matrices with sharp eigenvalue decrease, which is not the case 
for kernel matrices (see an illustrative example in the appendix, where the kernel approximation 
error decays much slower than the prediction error as the rank p increases). We provide a new 
proof technique based on regularizing the column sampling approximation and optimizing the 
extra regularization parameter using a monotonicity argument. 

- In our experiments, we have noticed that the low-rank approximation may have an additional 
regularizing effect leading to a better prediction performance than with the full kernel matrix. 

4.3 Optimal choice of the regularization parameter 

For simplicity, in this section, we assume that the noise variables e are i.i.d. (i.e., C = a 2 I). Our 
goal is to study simplified situations, where we can derive explicit formulas for the bias, the variance, 
and the optimal regularization parameter. Throughout this section, we will consider specific decays 
of certain sequences, which we characterize with the notation u n = @(v n ), which means that there 
exist strictly positive constants A and B such that Au n ^ v n ^ Bv n for all re. 

We assume that the kernel matrix K has eigenvalues of the form 9(re//j), i = 1, . . . , re, for some 
summable sequence (/ij) — so that tr K = G(n), and that the coordinates of z on the eigenbasis of K 
have the asymptotic behavior Q(^/nui) for a summable sequence (z^) — so that ^z T z = 0(1). In 
Table [T] we provide asymptotic equivalents of all quantities for several pairs of sequences (/ij) and 
(i>i) (see proofs in the appendix), with polynomial or exponential decays. 

Note that for decays of v\ which are polynomial, i.e., V{ = 0(i~ 2S ), then the best possible prediction 
performance is known to be C^n 1 / 25-1 ) |29l and is achieved if the RKHS is large enough (lines 2 
and 4 in Tabled). For exponential decay, the best performance is 0(log n/n). 

Given a specific decay (z/j) for expected outputs z = Ky, then depending on the decay (//j) of the 
eigenvalues of the kernel matrix, the final prediction performance and the optimal regularization 
parameter may be different. Usually, the smaller the RKHS, the faster the decay of eigenvalues of 
the kernel matrix K (this is true for translation-invariant kernels HI, and the kernels considered in 
Section [5]). Thus there are two regimes: 
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(Mi) 


fa) 


var. 


bias 


optimal A 


pred. perf. 


deg. freed, d 


condition 


i-2/J 
j-2/J 
j-2/J 
e -pi 
e -pi 
e -pi 


r 25 
r 25 

e -Ki 

e -Ki 
e -Ki 


rT l log j 
n _1 log j 
rT l log t 


A 2 

A (25-l)/2/3 

A 2 

(logi) 1 " 25 
A 2 


^-1/(2+1/2/3) 
n -f>/6 

^-1/(2+1/2/3) 

exp(-n 1 /( 25 )) 
n -i/2 

n -p/« 


n l/(4/3+l)-l 

n V(2*)-l 
n l/(4/9+l)-l 
n l/(2*)-l 
log n/n 
log n/n 


n l/(4/3+l) 
n l/(25) 
n l/(4/?+l) 
n l/(25) 

log n 
log n 


if 25 > 4/3 + 1 
if 25 < 4/3 + 1 

if k > 2p 
if k < 2p 



Table 1: Variance, bias, optimal regularization parameter and corresponding prediction perfor- 
mance, for several decays of eigenvalues and signal coefficients (we always assume 5 > 1/2, 
/3 > 1/2, p > 0, k > 0, to make the series summable). All entries are functions of i, n or A and are 
only asymptotic bounded below and above, i.e., corresponding to the asymptotic notation 0(-). 



- The RKHS is too large, for lines 1 and 3 in Table [T] the eigenvalues of K, which depend 
linearly on //j, do not decay fast enough. In other words, the functions in the RKHS are not 
smooth enough. In this situation, the prediction performance is suboptimal (do not attain the 
best possible rate). 

- The RKHS is too small, for lines 2, 4, and 6 in Table [T] the eigenvalues of K decay fast 
enough to get an optimal prediction performance. In other words, the functions in the RKHS 
are potentially smoother than what is necessary. In this situation however, the required value 
of A may be very small (much smaller than 0(n -1 )), leading to potentially harder optimization 
problems (since the condition number that depends on 1/A may be very large). 

There is thus a computational/statistical trade-off: if the RKHS is chosen too large, then the pre- 
diction performance is suboptimal; if the RKHS is chosen too small, the prediction performance 
could be optimal, but the optimization problems are harder, and sometimes cannot be solved with 
the classical precision of numerical techniques (see examples of such behavior in Section [5]). 

4.4 Optimization algorithms with column sampling 

Given a rank p and a regularization parameter A, we consider the following algorithm to solve 
Eq. ([]]) for twice differentiable convex losses: 

1. Select at random p columns of K (without replacement). 

2. Compute $ G R nxp such that $<I> T = K(V,I)K(I,rf K(I,V) using incomplete Cholesky 
decomposition (see details in El). 

3. Minimize min^gKP ^ Ya=i ^(^> (® w )i) + ^IMP using Newton's method (i.e., a single linear 
system for the square loss). 

The complexity of step 2 is already 0(p 2 n), therefore using faster techniques for step 3 (e.g., accel- 
erated gradient descent) does not change the overall complexity, which is thus 0(p 2 n). Moreover, 
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since we use a second-order method for step 3, we are robust to ill-conditioning and in particular to 
small values of A (though not below machine precision as seen in Section©. This is not the case for 
algorithms that relies on the strong convexity of the objective function, whose convergence is much 
slower when A is small (as seen in Section I4.3L when n grows, the optimal value of A can decay 
very rapidly, making these traditional methods non robust). 

According to Theorem [T] at least for the square loss, the dimension p may be chosen to be linear 
in the degrees of freedom d, which, as illustrated in Table [T] is typically smaller than n 1 / 2 (d is of 
the order of the prediction performance multiplied by n). Therefore, if p is properly chosen, the 
complexity is subquadratic. Given A, d (and thus p) can be estimated from a low-rank approxima- 
tion of K. However, our current analysis assumes that A is given. Selecting the rank p and the 
regularization parameter A in a data-driven way would make the prediction method more robust, but 
this would require extra assumptions (see, e.g., |[28l and references therein). 

5 Simulations 

Synthetic examples. In order to study various behaviors of the regularization parameters A and 
the degrees of freedom d, we consider periodic smoothing splines on [0, 1] and points x±, . . . , x n 
uniformly spread over [0, 1], either deterministically or randomly. In order to generate problems 
with given sequences (//j) and (y \), it suffices to choose k(x, y) = ^Mi cos 2i7r(x — y), and a 
function f(x) = YliLi 2^ 2 cos 2i-Kx. For ^ = i~ 2/3 , we have k(x, y) = j^yB2/3(x—y—[x—y\), 
where i?2/3 is the (2/3)-th Bernoulli polynomial (see details in the appendix). 

Optimal values of A. In a first experiment, we illustrate the results from Section |4~3l and compute in 
Figure [T]the best value of the regularization parameter (left) and the obtained predictive performance 
(middle), for a problem with v = i~ 2S for 5 = 8, and for which we considered several kernels, for 
which m = i' 2 ?, for (3 = 1, = 4 and = 8. We can make the following observations: 

- For = 1, the rate of convergence of ?i 1 /( 4 ' 3 + 1 ) _1 is achieved (line 1 in TableQ}, with a certain 
asymptotic decay of the regularization parameter, and it is slower than n 1 ' ( 25 ) _1 . 

- For /3 = 4, the optimal rate of 

n l/(25)-l is 

achieved (line 2 in Tabled), as expected. 

- For j3 = 8, the rate of convergence should be ?t, 1 /( 2<5 ) _1 (line 2 in Table [T]), however, as seen in 
the left plot, the regularization parameter saturates as n grows at the machine precision, leading, 
because of numerical errors, to worse prediction performance. The problem is so ill-conditioned 
that the matrix inversion cannot be algorithmically robust enough. 

Performance of low-rank approximations. In this series of experiments, we compute the rank p 
which is necessary to achieve a predictive performance at most 1% worse than with p = n, and 
compute^ the ratio with the marginal degrees of freedom d = n|| diag (K(K + nA/) -1 ) || and 
the traditional degrees of freedom d ave = tvK 2 (K + n\I)~ 2 . In the right plot of Figure [Q we 
consider data randomly distributed in [0, 1] with the same kernels and functions than above, while 

2 Note that in practice, computing the degrees of freedom exactly requires to know the full matrix. However, it can 
also be approximated efficiently using a low-rank approximation based on column sampling. 
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Figure 1: Left and middle: Effect of size of RKHS in predictive performance. Right: Ratio of 
the sufficient rank to obtain 1% worse predictive performance, over the degrees of freedom (plain: 
random column sampling, dashed: incomplete Cholesky decomposition with column pivoting). 




2 2.5 3 3.5 2 2.5 3 3.5 2 2.5 3 3.5 

log l0 (n) lo 9i (n) log io (n > 



Figure 2: Ratio of the sufficient rank to obtain 1% worse predictive performance, over the degrees 
of freedom (plain: random column sampling, dashed: incomplete Cholesky decomposition with 
column pivoting). From left to right: pumadyn datasets 32fh, 32nh, 32nm. 

in Figure El we considered three of the pumadyn datasets from the UCI machine learning repository 
(here we compute the classical generalization performance on unseen data points). We can make 
the following observations: 

- On all datasets, the ratios stay relatively close to one, illustrating the results from Theorem [TJ 

- Using pivoting to select the columns does not change significantly the results, but may some- 
times reduce the number of required columns by a constant factor. 

6 Conclusion 

In this paper, we have provided an analysis of column sampling for kernel least-squares regression 
that shows that the rank may be chosen proportional to the degrees of freedom of the problem, 
showing that the statistical quantity characterizing prediction performance also plays a computa- 
tional role. The current analysis could be extended in various ways: First, other column sampling 
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schemes beyond uniform, such as presented in iflQlfTTTl . could be considered with potentially better 
behavior; the analysis may also be extended to other losses than the square loss, such as the logistic 
loss, using self-concordant analysis 11301 . Finally, in this paper, we have considered a batch set- 
ting and extending these results to online settings is of significant practical and theoretical interest. 
In particular, the difficulty would be to study regularization parameters that adapt to the number 
of observations, which lead to vanishing strong convexity constants and does not allow rates of 
order 0(n _1 ). 
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A Duality for kernel supervised learning 

We consider the following problem, where F is an RKHS with feature map cf> : X ^ F: 

A 



4 = 1 

which may be rewritten with the feature map (f> as: 

1 n A 
min n - ^l{yi,Ui) + -||/|| 2 such that u { = (f,<f>(xi)}. 

' i=i 

We may then introduce dual parameters (Lagrange multipliers) a G W 1 and the Lagrangian 

-, n , n 

£(/, u,a) = -J2 KVi^i) + 2 ll/f + A E ^ " (/' 

71 i=l i=l 

Minimizing with respect to (/, u), we get / = Y17=i a i4>( x i) an d the dual problem: 

max — q{— Xa) a T Ka, 

where, for z G M n , g(z) = max ng ign — I ^™ =1 Uj) + i^z; is the Fenchel-conjugate of the 
empirical risk. 



B Comparison of relative errors of kernel approximation and predic- 
tion performance 

In Figure [3l we consider a prediction problem with n = 400 and a decay of eigenvalues of the 
kernel matrix which is the inverse of a low-order polynomial. We compare the decays to zero of 
the relative kernel matrix approximation \\K — L\\/\\K\\ (for the trace and operator norms) with 
the decay of the relative prediction performance (i.e., prediction for L minus prediction for the full 
matrix K). 

As the rank p increases, the decay of the relative prediction error is much faster than the error in ma- 
trix approximation, suggesting that relying on good kernel matrix approximation may be suboptimal 
if the goal is simply to predict well. 

C Proof of Theorem!] 

We first prove a lemma that provides a Bernstein- type inequality for subsampled covariance matri- 
ces. 
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Figure 3 : Relative average errors (in log-scale) as the rank p grows, for kernel matrix approximation 
1 1 if — L\\ (trace and operator norms) and predictions errors, for a synthetic prediction problem 
described in the experiments section of the main paper for n = 400. The prediction error (red 
curve) stops when the average prediction error of the column sampling approach gets below the 
prediction error of the full kernel matrix approach. 



C.l Concentration of subsampled covariance matrices 

Given the matrix \& £ W lXr and I C {1, . . . , p}, we denote by */ the submatrix of \f composed of 
the rows of \& indexed by I. 

Lemma 1 (Concentration of subsampled covariance) Let ^ G ]R nxr , with all rows of ' l^-norm 
less than R. Let I a random subset of {1, ... ,n} with p elements (i.e., p elements chosen without 
replacement uniformly at random). Then, for all t > 0, 



1/ A 



n p 



> 1 1 ^ r exp 



-pt 2 /2 



A max (i* T ^)( J R 2 + t/3), 
Proof Let ipi, . . . , ip n € W be the n rows of <I>. We consider the matrix A G W xr defined as: 

A = I*t* - n, 7 = lv^ T - - V ^J. 

n rt n — ^ n ^ — ^ 



n p n £ — ' p 

y i=l y iei 



By construction, we have EA = 0, and, as shown in fl3TJ[321 an d 13311 • we have 

Etrexp(sA) < Etrexp(sH), 
where E is obtained by sampling independently p rows with replacement, i.e., is equal to 

%=\ v j=i i=l 

where z J ' S W 1 is a random element of the canonical basis of W 1 such that %(z\ = 1) = - for all 
i € {1, . . . , n} and j G {1, . . . ,p}. This result extends to the matrix case the classical result of 
Hoeffding O. 
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We thus have: 

1 p , 1 n n s p 

^ j = l V i=\ i=l / j=l 

with Mj = i ( £™=i 4(^ T * - V^ T )) ■ We have EM j = °> A max (M,) < A max (±* T #)/p, and 

^ j=l ' P ^3=1 i=l k=l ' 

1 / 1 - 1 \ 1 

= -Amaxl - V (-* T * - Ai>J) 2 J because = -5 i=k 

v i=i ' 

^ i=i ^ i=i 

it! 2 1 

< — A max (-* T #) because ipiipj ip^J 4 R 2 ^J ■ 
p n ' 

We can then apply the matrix Bernstein inequality of (32j Theorem 6.1] to obtain the bound: 

t 2 /2 



r cxp 



— A max (-^ T ^) + -A max (i^ T ^)| 
p n ' p n '6 



which leads to the desired result. 



C.2 Proof of Theorem U 

Proof principle. Let $ G M nxn be such that K = <£<1> T . Note that if K has rank r, we could 
instead choose $ £ M nxr . 

We consider the regularized low-rank approximation L 7 = <J?iV 7 <I> T , with 

iv 7 = $7($ 7 $j + p 7 /)" 1 $ / = $7$/($j$j +pt^)" 1 = / - t(^7^//p + 7/)" 1 (8) 

(using the matrix inversion lemma). We have L = Lq but we will consider L 7 for 7 > to obtain a 
bound for 7 = 0, using a monotonicity argument. 

Following the same reasoning than in Section 4.1 of the main paper, the in-sample prediction error 

^ £ \\z Ll - z\\ 2 is equal to 

-EJIzr -z\\ 2 = n\ 2 \\(®N^ T + n\I)-h\\ 2 + -trC\$N 1 $ T ($N y $ T + n\I)~ 1 }^ 
n n 

= bias(L 7 ) + variance(L 7 ). 

The function 7 h-> iV 7 is matrix-non-increasing (i.e., if 7 ^ 7', then N y =5! iVy). Therefore, we have 
=^ iV 7 iV =<! I. Since the variance term variance(L 7 ) = i tr C[$iY 7 $ T ($iV 7 $ T -l-nAJ) -1 ] 2 
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is non-decreasing in iV 7 , this implies that the variance term with iV 7 is smaller than the one with iVo 
and then less then the one with iV 7 replaced by I (which corresponds to the variance term without 
any approximation). For the bias term we have: 



bias(L 7 ) = 7iA a ||($JV 7 $' + nXiy l zf = nX 2 z J ($iV 7 $ 1 + nXI) 



(9) 



which is a non-decreasing function of 7. Therefore, if we prove an upper-bound on the bias term 
for any 7 > 0, we have a bound for 7 = 0. This requires lower-bounding N 1 . 



Lower-bounding iV 7 . Let * = $(i$ T $ + 7/)" 1 / 2 G R nxn . We may rewrite iV 7 defined in 
Eq. ® as 



/-7(I$J$ / + 7 /)- 1 



P 



' n 
n 



-1/2 



1 

n 



(-$ T $ +7/)- 

n 



1/2 



Thus, in order to obtain a lower-bound on iV 7 , it suffices to have an upper-bound of the form 



n p 



(10) 



which would imply 

7 / 1 



I-iV 7 ^ -^-(-$ T $ + 7/) \ 
1 — t n 

7 



1 



K — = <S>(I - N^)® 1 4 T -^- r $(-$'$ + 7/)- i <I» 



1*T_ ™7 



1-V ' J 1-t 



l-t y n 

Assume ^ 1. We then have, using the previous inequality: 

(L^+nXiy 1 4 {K-^-J+nXiy 1 = (i^+nAfl-^]/)" 1 ^ (1- j/^-^K+nXI)' 1 . 

Thus, the bias term in Eq. © is less than the original bias term times (1 — T77) -2 . If the bound 
defined in Eq. ( fTOl ) is not met, then we can upper-bound the bias term by ^z 1 z, which is itself upper- 
bounded by the unapproximated bias term times (1+ ^-) — indeed, we have nX 2 z T (K+nXI)~ 2 z ^ 



nX 2 z T z(nX + nR 2 ) 



2\-2 _ 1 „T 



z z 



1 + R 2 /X)- 2 . Thus if we define p t = fj(A 



^J^i > tj , then, Ei [bias(L 7 )] is upper-bounded by 

B=p t {l + R 2 /X) + {l-p t ){\ 



7/A, 



(ID 



times bias (A). 
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Probabilistic control. We need to upper-bound the largest eigenvalue of ^J> where 

/ is a random subset of {1, . . . , n} of cardinality p. This is the difference between an empirical 
second-order moment and the empirical moment of a subset of p random elements. In order to 
apply LemmaQ] we need to upper-bound the squared ^-norm as (assuming 7 A): 

max ($(-$ T $ + 7/)" 1 $ T ).. 



max f*^ 7 ! 

ie{l,...,n} v ' 



ie{l,...,n} 



n 



1 



A7- 1 max ($(-(A7" 1 )$ l $ + AI)- 1 $ l ) 
ie{i,...,n} n /! 



1 



< A7 1 max ($(—<!> 1 <1> + XI) x $ 1 ) .. because 7 5^ A, 

ie{i,...,n} n ,n 

= nA7~ 1 1| diag (K(K + nXI)" 1 ) ^ = Xj^d. 

Thus for 7 ^ A, all rows of have a squared ^-norm upper-bounded by A7 _1 <i, and i<J/ T <I/ =<; /, 
we can apply LemmaQ} to obtain that: 



Pt = li A 



l n 



> t ^ n exp 



-pi 2 /2 



A7- 1 d + i/3 



Ad 



Using the bound from Eq. (TTTb . we get, given 5 G (0, 1), t = 1/2, and 7 - 



i? 2 r 
= 1 + — P* + (i-ft) (1 

ni? 2 f -p/8 
H ; — exp 



A 
nR 2 

^ 1 H r— exp 

A 

ni? 2 

^ H — exp 

A 

nR 2 

^ H ; — exp 



4d/5 + 1/6 

~P 
Z2d/5 + 2 

~P 
32d/<5 + 2 

~P 



A ^32^/5 + 2 



7/A, 

+ 
+ 

+ 5 



1 -1 

(1 - J/2)" 2 

5 - 5 2 /4 

(1 " V2) 2 
1 -(5/4 

(1 " V2) 2 
3/41 
1/4 



exp 



V32(i/5 + 2 



Thus, if p > + 2) log we obtain that B ^ 1 + 45. 



Thus, 



+ 35. 



Ej [bias(L) + variance(L)] ^ Ej [bias(L 7 ) + variance(-ff)] by monotonicity 

= E/ [bias(L 7 )] + variance(K) 

(1 + 4<5)bias(-FT) + variance^) 
5^ (1 + 45) [bias(i^) + variance(i^)] , 

which is the desired result. Note that 



- We could improve the bound by expliciting the reduction of the variance term. 

- In some situations, the prediction performance for the approximated version may in fact be 
smaller than the non-approximated version. 
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D Asymptotics of bias and variance terms 

In this appendix, we consider various decays of eigenvalues n/ij of K and components -JrWi (in 
magnitude) of the signal z to estimate. We follow the reasoning of ll35l (i.e., replacing sums by 
integrals). Given our assumptions, we have: 



For all cases we need to consider, for simplicity, we only provide an upper-bound for /Zj exactly 
equal to its asymptotic equivalent. Considering lower-bounds and a constant times the asymptotic 
equivalent may be done in a similar way. 

D.l Variance terms 

We consider the two possible cases (the variance term only depends on (/ij)). Moreover we show 
that the two traditional definitions of the degrees of freedom, tr K{K + nAI) -1 and XxK 2 (K + 
n\I)~ 2 , have the same asymptotically equivalents. 

Polynomial decay (/ij = i~ 2 @, j3 > 1/2). The renormalized variance term is less than 



n 



n 




n 

—^■variance 
a* 





W u i/2P 1 du with the change of variable u = At 2/3 , 




0(A 1 / 2/3 ) since the integral is finite. 



With the same reasoning, we have tr K(K+nXI) 1 ^ j { 



1 



(l+u) 
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Exponential decay (fa = e p% ). The renormalized variance term is less than 



-(1 + e^A) 2 ^ J (1 + e^A) 2 * " J (e^' + A) 2 ' 
1 r 1 « 



P J e -pn (« + A) 2 



du with the change of variable u = e 



-pt 



v p J \u + X ~ {u + X) 2 J U ^p~J u + X U 

= -[log(l + A)-logA] = 0(\og\). 
p L J A 

We the same technique, we get bounds on tr K(K + nXI)~ l in the same way we just did for 

tvK 2 (K + nXI)- 2 . 

D.2 Bias terms 

The bias terms depend on both (fa) and (z/j) and we consider all combinations. 
Polynomial decays (fa = i~ 2 ^ ', = i~ 2S , (3,6 > 1/2). The bias term is less than 



If 25 — 4/3 > 1, then we have an upper bound of 2nA 2 t 4/3 25 dt = 0(nX 2 ), because the integral 
is finite. 



If 25 — 4/3 < 1, then we can further bound Eq. (1121) as 

2nA 2 / xoTTxdt = 2nA 2 / ; ^ — du 



i (1 + t^A) 2 A (1 + u) 2 2/3 

with the change of variable u = Xt 2 ^ 

= O(X^W) U ; ^d« = Q(A( M -W), 

Jo + 2/5 

because the integral is finite (due to the assumptions made on /3 and 5). 



Exponential decays (fa = e p % i/j = e Kl , p,K > 0). The bias term is less than 

- (1 + e^A) 2 nA A (TTe^Ap 

nA /- Aenp (-u/A) 1 -^ 



72.A ^ 

i 



/ — ; T-^—du with the change of variables u = Xe p . 

Jx (1 + ^) 2 



19 



If n/p > 2, then we have a bound 

nX 2 f°° u 1 ~ k /p 



du = 0(nX 2 ), 



P Ji (1 + Xuf 
because the integral is finite and uniformly bounded in A. 
If k/p < 2, then we have a bound 

nX K/ P f Xe"P U i- K / P nX K l p f°° u 1 ~ k Ip , nl xlf/o . 

r d« ^ / - -^du = 0(nX K/p ). 



P Jx (l + u) 2 p J (l + u) 

Mixed decays. For pi with polynomial decays and v; t with exponential decays, we are in a situ- 
ation where V{ is decaying fast enough (faster than i~ 2S for any 5 > 1/2) so that, given previous 
results, the bias is nX 2 . 

The only remaining result to show is pi = e~ pt and Vj, = i~ 2S , 5 > 1/2, which we now consider. 
The bias term is equal to 



n n ._nx 

nJ ' 



{ (Pi + A) 2 ^ ( e -P* + A)s 



z- 2<5 r 25 



nA 2 : t-tt + nA 2 



(e-^ + A) 2 (e-^ + A) 2 

j^ilogA^ 1 n^i>| log A -1 

;-25 ;-2<5 

< nX2 E ^+" a2 E 

i^^logA- 1 ri£i>- log A" 1 

< n > ? + n 1 

isS^logA" 1 O^logA- 1 
p ° p 

= 0(n) + 0(n(log A" 1 ) 1 - 25 ) = 0(n(log A" 1 ) 1 - 25 ). 



D.3 Optimal regularization parameters 

We can now take all six cases, and compute the optimal A and the resulting optimal regularization 
error. 

- pi = i~ 2/3 , Vi = i~ 2S (25 > 4/3 + 1): we need to minimize with respect to A the function 

n -\ X -l/2f) + A 2 ; wWch leads tQ A _ ^-1/(2+1/2/3) and an optimal value f n l/(4/9+l)-l_ 

- Pi = i~ 2/3 , Vi = i~ 2S (25 < 4/3 + 1): we need to minimize with respect to A the function 
n _i A _i/ 2/ 3 + A (25-i)/2^ ) which leads tQ A _ n -/3/S and an p t i ma i va i ue Q f nVC 2 *)- 1 . 

- Pi = i~ 2/3 , Vi = e~ Kl : same computation as the first one. 

- pi = e~ pl , Vi = i~ 2S : we need to minimize with respect to A the function rT x log j + 
(log t) 1_2(5 , which leads to log t ~ n 1 / 25 and an optimal value of n 1 / 25-1 . 
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- fii = e p \ vi = e K% (k > 2p): we need to minimize with respect to A the function 
n" 1 log j + A 2 , which leads to A m n~ 1 / 2 and an optimal value of log n/n. 

- Hi = e~ pl , Vi = e~ Kl (k < 2p): we need to minimize with respect to A the function 
n _1 log j + A K / p , which leads to A m n~ p l K and an optimal value of log n/n. 



E Kernels on [0,1] 



In this appendix, we consider kernels on X = [0, 1] that lead to closed-form expressions (or asymp- 
totic equivalents) for eigenvalues of K and components of z. These are used in simulations. 



Kernels. For a positive summable sequence (jUj)j^i, we consider k(x, y) = Y^=\ ^ cos 2m{x— 
y). It is defined for any (x, y) G [0, l] 2 and is 1-periodic in x and y. It is a function g of x — y — 
[x — y\ , i.e., k(x, y) = g{x — y — [x — y\ ). Moreover k(x, x) is independent of x. 

For ^ = j^p, we have k(x, y) = j^yB 2 [s(x — y — [x — y\ ), where B 2 p is the (2/3)-th Bernoulli 
polynomial ll24l . For example, we have B 2 (x) = x 2 — x+^ and Bq(x) = x 6 — 3x 5 + |x 4 — \x 2 + ^. 

For ^ = e~ pi , we have, k(x, y) = 2^^%^. Indeed, we have 

' ' \ i n j e ip — 2eP cos 2ir(x— y)+l 

/ oo 

k(x,y) 



Re(j22e- pi+2iuJ7T{x -A with u 2 = -1, 
^ i=i ' 

2Re( yS-P+^-y))) = 2Re ( — J ( A 

y ^ j V i — e^ p+2ujn( . x ~y^ J 

2Re( , } , ) = 2 2 e P ™Mx-y)-l 



Data and eigenvectors. If we consider n data points X{ = i = 1, . . . , n, then the kernel 
matrix K has components Kij = k(^—^-, ^-)- It is a circulant matrix, thus it is diagonalisable in 
the discrete Fourier basis ||36l , with eigenvalues equal to the discrete Fourier transform of the first 
column of the matrix, i.e., (g(0), g(l/n), . . . ,g(l — l/n)) T . 



21 



Thus, the z-th eigenvector has j-th component ■^=e 2u% ^ 1 ) 7T / n (with oj 2 = 01) and the i-th eigen- 
value is 

n 

Xi = ^e- 2 ^- 1 )-/"< 7 ((j-l)/n) 

3=1 

n oo 

= e~ 2 ^- x ^l n Y 2/i. cos 2svr(j - 1) jn 

3=1 s=l 



^ e -2ui(3'-l)?r/n ^ ^2soj7r(j-l)/n. _|_ e -2sw7r(j-l)/nj 

j=l «=1 

oo 

n |U S [5 s= j[ n ] + <J- S =i[n]] because of the orthonormality of the Fourier basis, 



n(j,i + n ^ Vi+hn + n ^ M-t+fcn- 

ft=l h=l 



If ra is large and ^ tends to zero when i tends to +oo, then an asymptotic equivalent for Aj is n/ij. 
For data sampled from the uniform distribution in [0, 1], then similar equivalents hold (see, e.g., [35]]). 



Functions. Let f(x) = YnLi 2z V 

cos 2m x, for U{ a non-negative summable sequence. We 
consider z% = f(xi) = f((i — l)/n). The component of z on the z-th eigenvector of K is (following 
the same reasoning as above): 

£^ e -^Q-iW" /((i _i )/n) 

3=1 

(OO OO \ 

V l /2 + S ^ An + Yl U -i+hn ) > 
h=l h=l ' 

and the asymptotic equivalent is (nz/j) 1 / 2 . 



Link with Sobolev spaces. The kernel k(x, y) defined above corresponds for ^ = i~ 2 ° to certain 
Sobolev spaces |[24ll26l . Indeed, for (3 integer, the associated RKHS is the Sobolev space of periodic 
functions which are /3-times differentiable. 

Moreover, when z/j = i~ 2S , then for 5 > Sq, then the corresponding function is (5q — l/2)-times 
differentiable, and the minimax rate of estimation is known to be exactly 0(n 1//2<5 °) |[37ll29ll . Thus, 
up to logarithmic terms, the best possible rate is 0(n 1 / 2 ' 5 ), and is achieved if /3 is large enough (see 
Section 4.3 of the main paper). 
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